{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 亚马逊Amazon手表商品信息爬取\n",
    "\n",
    "*  电子讲义设计者：刘瑜鹏 181013091\n",
    "\n",
    "### 项目参考：\n",
    "\n",
    "* [Scrapy基础](https://www.jianshu.com/p/0f64297fc912) \n",
    "\n",
    "* 云端[scrapinghub](https://app.scrapinghub.com/)部署实践\n",
    "\n",
    "* [B站实操参考](https://www.bilibili.com/video/BV1es411F73F?from=search&seid=15379492346649266533)\n",
    "\n",
    "\n",
    "### 建议：\n",
    "\n",
    "* 熟悉 [xpath 语法](https://www.w3cschool.cn/xpath/xpath-syntax.html)丶[xpath 节点](https://www.w3cschool.cn/xpath/xpath-nodes.html)\n",
    "\n",
    "* 使用 [xpath cheatsheet](https://devhints.io/xpath)\n",
    "\n",
    "* [Scrapy基础](https://www.jianshu.com/p/0f64297fc912) \n",
    "\n",
    "* 云端[scrapinghub](https://app.scrapinghub.com/)部署实践\n",
    "\n",
    "* [Log日志级别](https://www.cnblogs.com/lingduqianli/p/7589173.html)\n",
    "\n",
    "<br/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 简介\n",
    "\n",
    "### 数据加值宣言：\n",
    "\n",
    "本项目产出按购物平台及亚马逊Amazon挖掘的关于手表watch的数据，以解决人们上网购买手表却懒于查找的问题。利用scrapy框架对数据进行爬取，并爬取到23页445条数据，输出json文件，并导出以标题（title）、价格（price）、评分级别（pingfen）、评价人数（number）、链接（herf）、页数（page）为主的csv文件，且每一笔数据有类别正确的数据，目的是为了更为方便地读取信息。本项目成功部署至scrapyhub。\n",
    "\n",
    "### 数据最小可用产品\n",
    "\n",
    "#### MVP的数据加值：\n",
    "\n",
    "从本次挖掘的数据来看，以商品标题title的形式直接呈现本次挖掘到的商品名称，便于查阅；因为人们总有一种先看价格再决定是否购买的心理，本次数据将爬取到price价格，可以更清晰地让我们查阅到相关商品的信息，更快筛选出我们的心理可承受价格；评分级别pingfen以五分为满分的给分方式，让我们更快得出改商品在购买者心目中的口碑，便于我们筛选商品；评价人数对应评分级别，可从此处得知评分级别是否具有代表性；链接href，以此来查看自己心仪的商品，可从此链接了解到更多详情。最后是页数page，对应到网页的页数，让用户更为方便的查找商品。\n",
    "\n",
    "#### 挖掘参数：\n",
    "\n",
    "以K：对搜索商品的定义；Page：页面（翻页所需）；title：标题；price：价格；pingfen：评分级别；number评价人数；herf链接为参数和关键词爬取相关数据\n",
    "\n",
    "#### 思路方法及具体执行\n",
    "\n",
    "•\t方法选择scrapy，对亚马逊Amazon网站进行数据爬取，其中该网站数据存储在HTML中。之所以选择一scrapy框架搭建本次项目，一方面是scrapy框架搭建相对于requests、selenium方法来说更为便捷；另一方面，scrapy是当下主流的爬取方式，以此方式搭建项目更显专业。补充说明，以requests方法搭建项目需要书写大量代码，更为耗时间；以selenium方法看起来更为智能，但是需要保存好测试应用的位置，后续有人想用此代码还需要对其进行改动，不利于别人的学习。综合考虑，选择scrapy框架搭建本次项目。\n",
    "\n",
    "•\t先搭好scrapy的基本框架，在spider下的py文档写上具体实行爬取的代码。\n",
    "\n",
    "•\titems.py下定义需要爬取的字段；\n",
    "\n",
    "•\t接着编写pipeline(管道)，在此处添加json文件的输出方式\n",
    "\n",
    "•\t通过在setting.py中进行设置来配置logging:，并且在setting.py中加入“FEED_EXPORT_ENCODING = 'utf-8-sig'”代码，避免输出内容出现乱码的现象\n",
    "\n",
    "•\t最后在后台以“-o”方法输出csv文件\n",
    "\n",
    "•\t完善项目后上传至github/gitee，由github部署至scrapyhub。\n",
    "\n",
    "### 心得总结及感谢\n",
    "\n",
    "本次数据挖掘项目说难不难，说简单也不简单，主要是一开始对scrapy的书写过程不理解，以至于出现不知道怎么开始的局面，后面在同学的建议下，在B站找了教程，跟着老师一步一步操作后，就觉得挺简单的，只要弄清楚管道这些是怎么运转的，其实也就差不多了。我觉得B站是个自学的好地方，大家有空可以上去瞅瞅。本次项目在参考[B站教程]( https://www.bilibili.com/video/BV1es411F73F?from=search&seid=15379492346649266533)后进行修改得出，非常庆幸B站有如此多的大牛分享实操过程，也非常感谢他们的付出，因为有了他们的分享，我才能完成此次期末项目。\n",
    "\n",
    "\n",
    "* [公开可访位之代码URL连结-jupyter文档](https://github.com/Crayon2/Scrapy_amazon/blob/master/%E6%9C%9F%E6%9C%AB%E9%A1%B9%E7%9B%AE.ipynb)\n",
    "\n",
    "* [仓库url--github](https://github.com/Crayon2/Scrapy_amazon)\n",
    "\n",
    "* [仓库url--gitee](https://gitee.com/crayon-heimi/Scrapy_amazon/tree/master)\n",
    "\n",
    "* [scrapyhub部署信息](https://gitee.com/crayon-heimi/Scrapy_amazon/blob/master/README.md)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### scrapy操作--主要流程\n",
    " \n",
    "<img src=\"./Scrapy框架图.png\" >\n",
    "\n",
    "* 如果图片显示不出，请点击[此处查看](https://gitee.com/crayon-heimi/Scrapy_amazon/blob/master/Scrapy框架图.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### scrapy操作--前期准备\n",
    "\n",
    "* 在本地新建一个文档\n",
    "\n",
    "* 打开cmd，进入上方新建文档\n",
    "\n",
    "* scrapy startproject 文档名【你想创建项目的名称】\n",
    "\n",
    "* cd 文档名\n",
    "\n",
    "* scrapy genspider 主要书写文件名称 \"爬取页面URL\"\n",
    "\n",
    "<img src=\"./Scrapy.png\" >\n",
    " \n",
    "* 如果图片显示不出，请点击[此处查看](https://gitee.com/crayon-heimi/Scrapy_amazon/blob/master/Scrapy.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### scrapy操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 在spider下创建的py文档修改【示例中的watch.py】"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import time\n",
    "from urllib import parse\n",
    "import scrapy\n",
    "from lxml import etree\n",
    "from scrapy import  Request\n",
    "from scrapy.linkextractors import LinkExtractor\n",
    "from scrapy.spiders import CrawlSpider, Rule\n",
    "\n",
    "\n",
    "class WatchSpider(scrapy.Spider):\n",
    "    name = 'watch'\n",
    "    allowed_domains = ['amazon.com']  #cmd创立文档时写下的url，写错了也没关系\n",
    "    # 想爬取 网站的开始页\n",
    "    start_urls = ['https://www.amazon.com/s?k=%E6%89%8B%E8%A1%A8&__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&ref=nb_sb_noss']\n",
    "\n",
    "# 爬取单页的xpath\n",
    "\n",
    "    def parse(self, response):\n",
    "        time.sleep(2) #睡两秒，延长爬取时间\n",
    "        # 商品标题\n",
    "        titles=response.xpath('//div[@class=\"sg-col-inner\"]/div/h2//a[@class=\"a-link-normal a-text-normal\"]/span/text()').extract()\n",
    "        #商品价格\n",
    "        price =response.xpath('//div[@class=\"sg-col-inner\"]//div[@class=\"a-row\"]//span[@class=\"a-price\"]/span[@class=\"a-offscreen\"]/text()').extract()\n",
    "        #评价人数\n",
    "        number =response.xpath('//span[@class=\"a-size-base\"]/text()').extract()\n",
    "        #评分级别\n",
    "        pingfen =response.xpath('//span[@class=\"a-icon-alt\"]/text()').extract()\n",
    "        #页数\n",
    "        page =response.xpath('//li[@class=\"a-selected\"]/a/text()').extract()\n",
    "        \n",
    "\n",
    "        url_1 = 'https://www.amazon.com'\n",
    "        url_2= response.xpath('//div[@class=\"sg-col-inner\"]/div/h2/a/@href').extract()\n",
    "        \n",
    "        herf = [parse.urljoin(url_1,item) for item in url_2]  \n",
    "        \n",
    "        for item in zip(titles,price,pingfen,number,herf,page):\n",
    "            yield{\n",
    "                \"title\":item[0],\n",
    "                \"price\":item[1],\n",
    "                \"pingfen\":item[2],\n",
    "                \"number\": item[3],\n",
    "                \"herf\": item[4],              \n",
    "                \"page\": item[5]\n",
    "            }\n",
    "            \n",
    "         # 利用item规定好关键词排列顺序            \n",
    "        \n",
    "#爬取多页            \n",
    "            \n",
    "        page_url_1 = 'https://www.amazon.com'\n",
    "        page_url_2 =response.xpath('//div[@class=\"a-text-center\"]/ul[@class=\"a-pagination\"]/li[@class=\"a-last\"]/a/@href').extract_first()\n",
    "\n",
    "        if next!=None:\n",
    "             next_url = parse.urljoin(page_url_1, page_url_2)  #拼接链接，进行翻页爬取\n",
    "        \n",
    "             yield Request(next_url)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### items.py （单元文件设置）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import scrapy\n",
    "\n",
    "\n",
    "class AmazonItem(scrapy.Item):\n",
    "    # define the fields for your item here like:\n",
    "    # name = scrapy.Field()\n",
    "    titles = scrapy.Field()\n",
    "    price = scrapy.Field()\n",
    "    pingfen = scrapy.Field()\n",
    "    number = scrapy.Field()\n",
    "    herf = scrapy.Field()\n",
    "    page = scrapy.Field()\n",
    "    \n",
    "    \n",
    "#定义你想抓取出来的关键词    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### pipelines.py （管道保存json文件）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scrapy.exporters import JsonLinesItemExporter\n",
    "\n",
    "class AmazonPipeline:\n",
    "    def __init__(self):\n",
    "        self.fp = open('watch.json','wb')    #输出名字，格式为json。scrapy普遍采取保存为json文件\n",
    "        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')\n",
    "\n",
    "    def process_item(self, item, spider):\n",
    "        self.exporter.export_item(item)\n",
    "        return item\n",
    "    def close_spider(self,spider):\n",
    "        self.fp.close()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### settings.py (设置规则)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "BOT_NAME = 'amazon'\n",
    "\n",
    "SPIDER_MODULES = ['amazon.spiders']\n",
    "NEWSPIDER_MODULE = 'amazon.spiders'\n",
    "FEED_EXPORT_ENCODING = 'utf-8-sig'   #文字编码，避免爬取中文数据变成乱码\n",
    "\n",
    "# Obey robots.txt rules   定义规则\n",
    "ROBOTSTXT_OBEY = True\n",
    "\n",
    "LOG_LEVEL =\"WARN\"   #Log日志级别，详情可查看上面的链接-Log日志级别\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 输出csv格式表格"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "scrapy crawl watch -o watch.csv    #在后台输入scrapy crawl spider下创建的文件 -o 你自己想保存的名字.csv \n",
    "\n",
    "# 表格保存路径直接是在文件夹根目录下，不需要重新书写保存路径"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### scrapyhub部署信息"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 部署\n",
    "\n",
    "<img src=\"./success.png\" >\n",
    "\n",
    "* 如果图片显示不出，请点击[此处查看](https://gitee.com/crayon-heimi/Scrapy_amazon/blob/master/success.png)\n",
    "    \n",
    "<img src=\"./scrapyhub.png\" >\n",
    "\n",
    "* 如果图片显示不出，请点击[此处查看](https://gitee.com/crayon-heimi/Scrapy_amazon/blob/master/scrapyhub.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 运行\n",
    "\n",
    "<img src=\"./mining.png\" >\n",
    "\n",
    "* 如果图片显示不出，请点击[此处查看](https://gitee.com/crayon-heimi/Scrapy_amazon/blob/master/mining.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "284.444px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
